{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "f76d45d2",
   "metadata": {},
   "source": [
    "### Extract Structured Data From Text: Expert Mode (Using Kor)\n",
    "\n",
    "For complicated data extraction you need a robust library. The [Kor Library](https://eyurtsev.github.io/kor/nested_objects.html) (created by [Eugene Yurtsev](https://twitter.com/veryboldbagel)) is an awesome tool just for this.\n",
    "\n",
    "We are going to explore using Kor with a practical use case.\n",
    "\n",
    "**Why is this important?**\n",
    "LLMs are great at text output, but they need extra help outputing information in a structure that we want. A common request from developers is to get JSON data back from our LLMs.\n",
    "\n",
    "Spoiler: Jump down to the bottom to see a bonefied business idea that you can start and manage today."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "7764b585",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Kor!\n",
    "from kor.extraction import create_extraction_chain\n",
    "from kor.nodes import Object, Text, Number\n",
    "\n",
    "# LangChain Models\n",
    "from langchain.chat_models import ChatOpenAI\n",
    "from langchain.llms import OpenAI\n",
    "\n",
    "# Standard Helpers\n",
    "import pandas as pd\n",
    "import requests\n",
    "import time\n",
    "import json\n",
    "from datetime import datetime\n",
    "\n",
    "# Text Helpers\n",
    "from bs4 import BeautifulSoup\n",
    "from markdownify import markdownify as md\n",
    "\n",
    "# For token counting\n",
    "from langchain.callbacks import get_openai_callback\n",
    "\n",
    "def printOutput(output):\n",
    "    print(json.dumps(output,sort_keys=True, indent=3))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "e1728e00",
   "metadata": {},
   "outputs": [],
   "source": [
    "# It's better to do this an environment variable but putting it in plain text for clarity\n",
    "openai_api_key = 'your_api_key'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "id": "7f781711",
   "metadata": {
    "hide_input": false
   },
   "outputs": [],
   "source": [
    "openai_api_key = '...'"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "29c41e97",
   "metadata": {},
   "source": [
    "Let's start off by creating our LLM. We're using gpt4 to take advantage of its increased ability to follow instructions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "f0060ded",
   "metadata": {},
   "outputs": [],
   "source": [
    "llm = ChatOpenAI(\n",
    "#     model_name=\"gpt-3.5-turbo\", # Cheaper but less reliable\n",
    "    model_name=\"gpt-4\",\n",
    "    temperature=0,\n",
    "    max_tokens=2000,\n",
    "    openai_api_key=openai_api_key\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1b0786a3",
   "metadata": {},
   "source": [
    "### Kor Hello World Example\n",
    "\n",
    "Create an object that holds information about the fields you'd like to extract"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "496d7d5b",
   "metadata": {},
   "outputs": [],
   "source": [
    "person_schema = Object(\n",
    "    # This what will appear in your output. It's what the fields below will be nested under.\n",
    "    # It should be the parent of the fields below. Usually it's singular (not plural)\n",
    "    id=\"person\",\n",
    "    \n",
    "    # Natural language description about your object\n",
    "    description=\"Personal information about a person\",\n",
    "    \n",
    "    # Fields you'd like to capture from a piece of text about your object.\n",
    "    attributes=[\n",
    "        Text(\n",
    "            id=\"first_name\",\n",
    "            description=\"The first name of a person.\",\n",
    "        )\n",
    "    ],\n",
    "    \n",
    "    # Examples help go a long way with telling the LLM what you need\n",
    "    examples=[\n",
    "        (\"Alice and Bob are friends\", [{\"first_name\": \"Alice\"}, {\"first_name\": \"Bob\"}])\n",
    "    ]\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e6ae5bd8",
   "metadata": {},
   "source": [
    "Create a chain that will extract the information and then parse it. This uses LangChain under the hood"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "00e3c457",
   "metadata": {},
   "outputs": [],
   "source": [
    "chain = create_extraction_chain(llm, person_schema)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "109ba12a",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{\n",
      "   \"person\": [\n",
      "      {\n",
      "         \"first_name\": \"Bobby\"\n",
      "      },\n",
      "      {\n",
      "         \"first_name\": \"Rachel\"\n",
      "      },\n",
      "      {\n",
      "         \"first_name\": \"Joe\"\n",
      "      }\n",
      "   ]\n",
      "}\n"
     ]
    }
   ],
   "source": [
    "text = \"\"\"\n",
    "    My name is Bobby.\n",
    "    My sister's name is Rachel.\n",
    "    My brother's name Joe. My dog's name is Spot\n",
    "\"\"\"\n",
    "output = chain.predict_and_parse(text=(text))[\"data\"]\n",
    "\n",
    "printOutput(output)\n",
    "# Notice how there isn't \"spot\" in the results list because it's the name of a dog, not a person."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "210c4f98",
   "metadata": {},
   "source": [
    "Kor also facilitates returning None when the LLM doesn't find what you're looking for"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "b60e5f8d",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{\n",
      "   \"person\": []\n",
      "}\n"
     ]
    }
   ],
   "source": [
    "output = chain.predict_and_parse(text=(\"The dog went to the park\"))[\"data\"]\n",
    "printOutput(output)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fbe68a3d",
   "metadata": {},
   "source": [
    "### Multiple Fields\n",
    "You can pass multiple fields if you're looking for more information"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "b48aeaf3",
   "metadata": {},
   "outputs": [],
   "source": [
    "plant_schema = Object(\n",
    "    id=\"plant\",\n",
    "    description=\"Information about a plant\",\n",
    "    \n",
    "    # Notice I put multiple fields to pull out different attributes\n",
    "    attributes=[\n",
    "        Text(\n",
    "            id=\"plant_type\",\n",
    "            description=\"The common name of the plant.\"\n",
    "        ),\n",
    "        Text(\n",
    "            id=\"color\",\n",
    "            description=\"The color of the plant\"\n",
    "        ),\n",
    "        Number(\n",
    "            id=\"rating\",\n",
    "            description=\"The rating of the plant.\"\n",
    "        )\n",
    "    ],\n",
    "    examples=[\n",
    "        (\n",
    "            \"Roses are red, lilies are white and a 8 out of 10.\",\n",
    "            [\n",
    "                {\"plant_type\": \"Roses\", \"color\": \"red\"},\n",
    "                {\"plant_type\": \"Lily\", \"color\": \"white\", \"rating\" : 8},\n",
    "            ],\n",
    "        )\n",
    "    ]\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "d662e926",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{\n",
      "   \"plant\": [\n",
      "      {\n",
      "         \"color\": \"brown\",\n",
      "         \"plant_type\": \"Palm tree\",\n",
      "         \"rating\": \"6.0\"\n",
      "      },\n",
      "      {\n",
      "         \"color\": \"green\",\n",
      "         \"plant_type\": \"Sequoia tree\",\n",
      "         \"rating\": \"\"\n",
      "      }\n",
      "   ]\n",
      "}\n"
     ]
    }
   ],
   "source": [
    "text=\"Palm trees are brown with a 6 rating. Sequoia trees are green\"\n",
    "\n",
    "chain = create_extraction_chain(llm, plant_schema)\n",
    "output = chain.predict_and_parse(text=text)['data']\n",
    "\n",
    "printOutput(output)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "87e0a551",
   "metadata": {},
   "source": [
    "### Working With Lists\n",
    "\n",
    "You can also extract lists as well.\n",
    "\n",
    "Note: Check out how I have a nested object. The 'parts' object is in the 'cars_schema'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "f51a975f",
   "metadata": {},
   "outputs": [],
   "source": [
    "parts = Object(\n",
    "    id=\"parts\",\n",
    "    description=\"A single part of a car\",\n",
    "    attributes=[\n",
    "        Text(id=\"part\", description=\"The name of the part\")\n",
    "    ],\n",
    "    examples=[\n",
    "        (\n",
    "            \"the jeep has wheels and windows\",\n",
    "            [\n",
    "                {\"part\": \"wheel\"},\n",
    "                {\"part\": \"window\"}\n",
    "            ],\n",
    "        )\n",
    "    ]\n",
    ")\n",
    "\n",
    "cars_schema = Object(\n",
    "    id=\"car\",\n",
    "    description=\"Information about a car\",\n",
    "    examples=[\n",
    "        (\n",
    "            \"the bmw is red and has an engine and steering wheel\",\n",
    "            [\n",
    "                {\"type\": \"BMW\", \"color\": \"red\", \"parts\" : [\"engine\", \"steering wheel\"]}\n",
    "            ],\n",
    "        )\n",
    "    ],\n",
    "    attributes=[\n",
    "        Text(\n",
    "            id=\"type\",\n",
    "            description=\"The make or brand of the car\"\n",
    "        ),\n",
    "        Text(\n",
    "            id=\"color\",\n",
    "            description=\"The color of the car\"\n",
    "        ),\n",
    "        parts\n",
    "    ]\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "ca4d0cdb",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{\n",
      "   \"car\": {\n",
      "      \"color\": \"blue\",\n",
      "      \"parts\": [\n",
      "         {\n",
      "            \"part\": \"rear view mirror\"\n",
      "         },\n",
      "         {\n",
      "            \"part\": \"roof\"\n",
      "         },\n",
      "         {\n",
      "            \"part\": \"windshield\"\n",
      "         }\n",
      "      ],\n",
      "      \"type\": \"jeep\"\n",
      "   }\n",
      "}\n"
     ]
    }
   ],
   "source": [
    "# To do nested objects you need to specify encoder_or_encoder_class=\"json\"\n",
    "text = \"The blue jeep has rear view mirror, roof, windshield\"\n",
    "\n",
    "# Changed the encoder to json\n",
    "chain = create_extraction_chain(llm, cars_schema, encoder_or_encoder_class=\"json\")\n",
    "output = chain.predict_and_parse(text=text)['data']\n",
    "\n",
    "printOutput(output)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bcdac769",
   "metadata": {},
   "source": [
    "View the prompt that was sent over"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "544e11fd",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Your goal is to extract structured information from the user's input that matches the form described below. When extracting information please make sure it matches the type information exactly. Do not add any attributes that do not appear in the schema shown below.\n",
      "\n",
      "```TypeScript\n",
      "\n",
      "car: { // Information about a car\n",
      " type: string // The make or brand of the car\n",
      " color: string // The color of the car\n",
      " parts: { // A single part of a car\n",
      "  part: string // The name of the part\n",
      " }\n",
      "}\n",
      "```\n",
      "\n",
      "\n",
      "Please output the extracted information in JSON format. Do not output anything except for the extracted information. Do not add any clarifying information. Do not add any fields that are not in the schema. If the text contains attributes that do not appear in the schema, please ignore them. All output must be in JSON format and follow the schema specified above. Wrap the JSON in <json> tags.\n",
      "\n",
      "Input: the bmw is red and has an engine and steering wheel\n",
      "Output: <json>{\"car\": [{\"type\": \"BMW\", \"color\": \"red\", \"parts\": [\"engine\", \"steering wheel\"]}]}</json>\n",
      "Input: the jeep has wheels and windows\n",
      "Output: <json>{\"car\": {\"parts\": [{\"part\": \"wheel\"}, {\"part\": \"window\"}]}}</json>\n",
      "Input: The blue jeep has rear view mirror, roof, windshield\n",
      "Output:\n"
     ]
    }
   ],
   "source": [
    "prompt = chain.prompt.format_prompt(text=text).to_string()\n",
    "\n",
    "print(prompt)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "66d9388f",
   "metadata": {},
   "source": [
    "Kor is a really great way to extract actions from a user as well"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "801b200b",
   "metadata": {},
   "outputs": [],
   "source": [
    "schema = Object(\n",
    "  id=\"forecaster\",\n",
    "  description=(\n",
    "      \"User is controling an app that makes financial forecasts. \"\n",
    "      \"They will give a command to update a forecast in the future\"\n",
    "  ),\n",
    "  attributes=[\n",
    "      Text(\n",
    "          id=\"year\",\n",
    "          description=\"Year the user wants to update\",\n",
    "          examples=[(\"please increase 2014's customers by 15%\", \"2014\")],\n",
    "          many=True,\n",
    "      ),\n",
    "      Text(\n",
    "          id=\"metric\",\n",
    "          description=\"The unit or metric a user would like to influence\",\n",
    "          examples=[(\"please increase 2014's customers by 15%\", \"customers\")],\n",
    "          many=True,\n",
    "      ),\n",
    "      Text(\n",
    "          id=\"amount\",\n",
    "          description=\"The quantity of a forecast adjustment\",\n",
    "          examples=[(\"please increase 2014's customers by 15%\", \".15\")],\n",
    "          many=True,\n",
    "      )\n",
    "    ],\n",
    "  many=False,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "02f45b66",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{\n",
      "   \"forecaster\": {\n",
      "      \"amount\": [\n",
      "         \"15\"\n",
      "      ],\n",
      "      \"metric\": [\n",
      "         \"units sold\"\n",
      "      ],\n",
      "      \"year\": [\n",
      "         \"2023\"\n",
      "      ]\n",
      "   }\n",
      "}\n"
     ]
    }
   ],
   "source": [
    "chain = create_extraction_chain(llm, schema, encoder_or_encoder_class='json')\n",
    "output = chain.predict_and_parse(text=\"please add 15 more units sold to 2023\")['data']\n",
    "\n",
    "printOutput(output)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "75568586",
   "metadata": {},
   "source": [
    "## Opening Attributes - Real World Example\n",
    "\n",
    "[Opening Attributes](https://twitter.com/GregKamradt/status/1643027796850253824) (my sample project for this application)\n",
    "\n",
    "If anyone wants to strategize on this project DM me on twitter"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "0777cb73",
   "metadata": {},
   "outputs": [],
   "source": [
    "llm = ChatOpenAI(\n",
    "    # model_name=\"gpt-3.5-turbo\",\n",
    "    model_name=\"gpt-4\",\n",
    "    temperature=0,\n",
    "    max_tokens=2000,\n",
    "    openai_api_key=openai_api_key\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4932a241",
   "metadata": {},
   "source": [
    "We are going to be pulling jobs from Greenhouse. No API key is needed."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "3cb6525f",
   "metadata": {},
   "outputs": [],
   "source": [
    "def pull_from_greenhouse(board_token):\n",
    "    # If doing this in production, make sure you do retries and backoffs\n",
    "    \n",
    "    # Get your URL ready to accept a parameter\n",
    "    url = f'https://boards-api.greenhouse.io/v1/boards/{board_token}/jobs?content=true'\n",
    "    \n",
    "    try:\n",
    "        response = requests.get(url)\n",
    "    except:\n",
    "        # In case it doesn't work\n",
    "        print (\"Whoops, error\")\n",
    "        return\n",
    "        \n",
    "    status_code = response.status_code\n",
    "    \n",
    "    jobs = response.json()['jobs']\n",
    "    \n",
    "    print (f\"{board_token}: {status_code}, Found {len(jobs)} jobs\")\n",
    "    \n",
    "    return jobs"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "636e4e8b",
   "metadata": {},
   "source": [
    "Let's try it out for [Okta](https://www.okta.com/)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "d2b794b9",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "okta: 200, Found 142 jobs\n"
     ]
    }
   ],
   "source": [
    "jobs = pull_from_greenhouse(\"okta\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "10c8651d",
   "metadata": {},
   "source": [
    "Let's look at a sample job with it's raw dictionary"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "id": "1a458cb9",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Keep in mind that my job_ids will likely change when you run this depending on the postings of the company\n",
    "job_index = 0"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "id": "b2b394e3",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Preview:\n",
      "\n",
      "{\"absolute_url\": \"https://www.okta.com/company/careers/opportunity/4977199?gh_jid=4977199\", \"data_compliance\": [{\"type\": \"gdpr\", \"requires_consent\": false, \"requires_processing_consent\": false, \"requires_retention_consent\": false, \"retention_period\": null}], \"internal_job_id\": 2518868, \"location\": {\"name\": \"Melbourne \"}, \"metadata\": null, \"id\": 4977199, \"updated_at\": \"2023-04-05T22:41:12-04:00\", \"\n"
     ]
    }
   ],
   "source": [
    "print (\"Preview:\\n\")\n",
    "print (json.dumps(jobs[job_index])[:400])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5662c3cb",
   "metadata": {},
   "source": [
    "Let's clean this up a bit"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "id": "ce0ee96a",
   "metadata": {},
   "outputs": [],
   "source": [
    "# I parsed through an output to create the function below\n",
    "def describeJob(job_description):\n",
    "    print(f\"Job ID: {job_description['id']}\")\n",
    "    print(f\"Link: {job_description['absolute_url']}\")\n",
    "    print(f\"Updated At: {datetime.fromisoformat(job_description['updated_at']).strftime('%B %-d, %Y')}\")\n",
    "    print(f\"Title: {job_description['title']}\\n\")\n",
    "    print(f\"Content:\\n{job_description['content'][:550]}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "72759972",
   "metadata": {},
   "source": [
    "We'll look at another job. This job_id may or may not work for you depending on if the position is still active."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "id": "12a43cea",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Job ID: 4982726\n",
      "Link: https://www.okta.com/company/careers/opportunity/4982726?gh_jid=4982726\n",
      "Updated At: April 11, 2023\n",
      "Title: Staff Software Engineer \n",
      "\n",
      "Content:\n",
      "&lt;div class=&quot;content-intro&quot;&gt;&lt;p&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;strong&gt;Get to know Okta&lt;/strong&gt;&lt;/span&gt;&lt;/p&gt;\n",
      "&lt;p&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;br&gt;&lt;/span&gt;Okta is The World’s Identity Company. We free everyone to safely use any technology—anywhere, on any device or app. Our Workforce and Customer Identity Clouds enable secure yet flexible access, authentication, and automation that transforms how people move through the digital world, putting Identity at t\n"
     ]
    }
   ],
   "source": [
    "# Note: I'm using a hard coded job id below. You'll need to switch this if this job ever changes\n",
    "# and it most definitely will!\n",
    "job_id = 4982726\n",
    "\n",
    "job_description = [item for item in jobs if item['id'] == job_id][0]\n",
    "    \n",
    "describeJob(job_description)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e2a3a009",
   "metadata": {},
   "source": [
    "I want to convert the html to text, we'll use BeautifulSoup to do this. There are multiple methods you could choose from. Pick what's best for you."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "id": "bd3e15c4",
   "metadata": {},
   "outputs": [],
   "source": [
    "soup = BeautifulSoup(job_description['content'], 'html.parser')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "id": "5a991fa4",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "**Get to know Okta**\n",
      "\n",
      "\n",
      "  \n",
      "Okta is The World’s Identity Company. We free everyone to safely use any technology—anywhere, on any device or app. Our Workforce and Customer Identity Clouds enable secure yet flexible access, authentication, and automation that transforms how people move through the digital world, putting Identity at the heart of business security and growth.   \n",
      "  \n",
      "At Okta, we celebrate a variety of perspectives and experiences. We are not looking for someone who checks every single box, we’re looking for lifelong learners and people who can make us better with their unique experien\n"
     ]
    }
   ],
   "source": [
    "text = soup.get_text()\n",
    "\n",
    "# Convert your html to markdown. This reduces tokens and noise\n",
    "text = md(text)\n",
    "\n",
    "print (text[:600])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8f673bc2",
   "metadata": {},
   "source": [
    "Let's create a Kor object that will look for tools. This is the meat and potatoes of the application"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "id": "ee906edf",
   "metadata": {},
   "outputs": [],
   "source": [
    "tools = Object(\n",
    "    id=\"tools\",\n",
    "    description=\"\"\"\n",
    "        A tool, application, or other company that is listed in a job description.\n",
    "        Analytics, eCommerce and GTM are not tools\n",
    "    \"\"\",\n",
    "    attributes=[\n",
    "        Text(\n",
    "            id=\"tool\",\n",
    "            description=\"The name of a tool or company\"\n",
    "        )\n",
    "    ],\n",
    "    examples=[\n",
    "        (\n",
    "            \"Experience in working with Netsuite, or Looker a plus.\",\n",
    "            [\n",
    "                {\"tool\": \"Netsuite\"},\n",
    "                {\"tool\": \"Looker\"},\n",
    "            ],\n",
    "        ),\n",
    "        (\n",
    "           \"Experience with Microsoft Excel\",\n",
    "            [\n",
    "               {\"tool\": \"Microsoft Excel\"}\n",
    "            ] \n",
    "        ),\n",
    "        (\n",
    "           \"You must know AWS to do well in the job\",\n",
    "            [\n",
    "               {\"tool\": \"AWS\"}\n",
    "            ] \n",
    "        ),\n",
    "        (\n",
    "           \"Troubleshooting customer issues and debugging from logs (Splunk, Syslogs, etc.) \",\n",
    "            [\n",
    "               {\"tool\": \"Splunk\"},\n",
    "            ] \n",
    "        )\n",
    "    ],\n",
    "    many=True,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "id": "fd0074a6",
   "metadata": {},
   "outputs": [],
   "source": [
    "chain = create_extraction_chain(llm, tools, input_formatter=\"triple_quotes\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "id": "f8015920",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{\n",
      "   \"tools\": [\n",
      "      {\n",
      "         \"tool\": \"Okta\"\n",
      "      },\n",
      "      {\n",
      "         \"tool\": \"Java\"\n",
      "      },\n",
      "      {\n",
      "         \"tool\": \"Hibernate\"\n",
      "      },\n",
      "      {\n",
      "         \"tool\": \"Spring Boot\"\n",
      "      },\n",
      "      {\n",
      "         \"tool\": \"AWS\"\n",
      "      },\n",
      "      {\n",
      "         \"tool\": \"GCP\"\n",
      "      },\n",
      "      {\n",
      "         \"tool\": \"SQL\"\n",
      "      },\n",
      "      {\n",
      "         \"tool\": \"ElasticSearch\"\n",
      "      },\n",
      "      {\n",
      "         \"tool\": \"Docker\"\n",
      "      },\n",
      "      {\n",
      "         \"tool\": \"Kubernetes\"\n",
      "      }\n",
      "   ]\n",
      "}\n"
     ]
    }
   ],
   "source": [
    "output = chain.predict_and_parse(text=text)[\"data\"]\n",
    "\n",
    "printOutput(output)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "09a7e59d",
   "metadata": {},
   "source": [
    "### Salary\n",
    "\n",
    "Let's grab salary information while we are at it.\n",
    "\n",
    "Not all jobs will list this information. If they do, it's rarely consistent across jobs. A great use case for LLMs to catch this information!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "id": "8e93e9af",
   "metadata": {},
   "outputs": [],
   "source": [
    "salary_range = Object(\n",
    "    id=\"salary_range\",\n",
    "    description=\"\"\"\n",
    "        The range of salary offered for a job mentioned in a job description\n",
    "    \"\"\",\n",
    "    attributes=[\n",
    "        Number(\n",
    "            id=\"low_end\",\n",
    "            description=\"The low end of a salary range\"\n",
    "        ),\n",
    "        Number(\n",
    "            id=\"high_end\",\n",
    "            description=\"The high end of a salary range\"\n",
    "        )\n",
    "    ],\n",
    "    examples=[\n",
    "        (\n",
    "            \"This position will make between $140 thousand and $230,000.00\",\n",
    "            [\n",
    "                {\"low_end\": 140000, \"high_end\": 230000},\n",
    "            ]\n",
    "        )\n",
    "    ]\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "id": "0d0045ba",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "cruise: 200, Found 219 jobs\n"
     ]
    }
   ],
   "source": [
    "jobs = pull_from_greenhouse(\"cruise\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "id": "9f514a5e",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Job ID: 4858414\n",
      "Link: https://boards.greenhouse.io/cruise/jobs/4858414?gh_jid=4858414\n",
      "Updated At: April 12, 2023\n",
      "Title: Senior Data Center Technician\n",
      "\n",
      "Content:\n",
      "&lt;div class=&quot;content-intro&quot;&gt;&lt;p&gt;&lt;span style=&quot;font-weight: 400;&quot;&gt;We&#39;re Cruise, a self-driving service designed for the cities we love.&lt;/span&gt;&lt;/p&gt;\n",
      "&lt;p&gt;&lt;span style=&quot;font-weight: 400;&quot;&gt;We’re building the world’s most advanced self-driving vehicles to safely connect people to the places, things, and experiences they care about. We believe self-driving vehicles will help save lives, reshape cities, give back time in transit, and restore freedom of movement for many.&lt;/span&gt;\n",
      "We're Cruise, a self-driving service designed for the cities we love.\n",
      "\n",
      "\n",
      "We’re building the world’s most advanced self-driving vehicles to safely connect people to the places, things, and experiences they care about. We believe self-driving vehicles will help save lives, reshape cities, give back time in transit, and restore freedom of movement for many.\n",
      "\n",
      "\n",
      "In our cars, you’re free to be yourself. It’s the same here at Cruise. We’re creating a culture that values the experiences and contributions of all of the unique individuals who collectively make up Cruise, so that every employee can do thei\n"
     ]
    }
   ],
   "source": [
    "# This job id may not work for you, pick another one from the list if it doesn't.\n",
    "job_id = 4858414\n",
    "\n",
    "job_description = [item for item in jobs if item['id'] == job_id][0]\n",
    "    \n",
    "describeJob(job_description)\n",
    "\n",
    "soup = BeautifulSoup(job_description['content'], 'html.parser')\n",
    "text = soup.get_text()\n",
    "\n",
    "# Convert your html to markdown. This reduces tokens and noise\n",
    "text = md(text)\n",
    "\n",
    "print (text[:600])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "id": "bb307900",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{\n",
      "   \"salary_range\": [\n",
      "      {\n",
      "         \"high_end\": \"165000\",\n",
      "         \"low_end\": \"112300\"\n",
      "      }\n",
      "   ]\n",
      "}\n"
     ]
    }
   ],
   "source": [
    "chain = create_extraction_chain(llm, salary_range)\n",
    "output = chain.predict_and_parse(text=text)[\"data\"]\n",
    "\n",
    "printOutput(output)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d25f76f9",
   "metadata": {},
   "source": [
    "> The salary range for this position is $112,300 - 165,000. Compensation will vary depending on location, job-related knowledge, skills, and experience. You may also be offered a bonus, restricted stock units, and benefits. These ranges are subject to change.\n",
    "\n",
    "Awesome!"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6855ab40",
   "metadata": {},
   "source": [
    "[OpenAI GPT4 Pricing](https://help.openai.com/en/articles/7127956-how-much-does-gpt-4-cost)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "id": "4e67af85",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Total Tokens: 1768\n",
      "Prompt Tokens: 1757\n",
      "Completion Tokens: 11\n",
      "Successful Requests: 1\n",
      "Total Cost (USD): $0.053369999999999994\n"
     ]
    }
   ],
   "source": [
    "with get_openai_callback() as cb:\n",
    "    result = chain.predict_and_parse(text=text)\n",
    "    print(f\"Total Tokens: {cb.total_tokens}\")\n",
    "    print(f\"Prompt Tokens: {cb.prompt_tokens}\")\n",
    "    print(f\"Completion Tokens: {cb.completion_tokens}\")\n",
    "    print(f\"Successful Requests: {cb.successful_requests}\")\n",
    "    print(f\"Total Cost (USD): ${cb.total_cost}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b770a9e8",
   "metadata": {},
   "source": [
    "Suggested To Do if you want to build this out:\n",
    "\n",
    "* Reduce amount of HTML and low-signal text that gets put into the prompt\n",
    "* Gather list of 1000s of companies\n",
    "* Run through most jobs (You'll likely start to see duplicate information after the first 10-15 jobs per department)\n",
    "* Store results\n",
    "* Snapshot daily as you look for new jobs\n",
    "* Follow [Greg](https://twitter.com/GregKamradt) on Twitter for more tools or if you want to chat about this project\n",
    "* Read the user feedback below for what else to build out with this project (I reached out to everyone who signed up on twitter)\n",
    "\n",
    "\n",
    "### Business idea: Job Data As A Service\n",
    "\n",
    "Start a data service that collects information about company's jobs. This can be sold to investors looking for an edge.\n",
    "\n",
    "After posting [this tweet](https://twitter.com/GregKamradt/status/1643027796850253824) there were 80 people that signed up for the trial. I emailed all of them and most were job seekers looking for companies that used the tech they specialized in.\n",
    "\n",
    "The more interesting use case were sales teams + investors.\n",
    "\n",
    "#### Interesting User Feedback (Persona: Investor):\n",
    "\n",
    "> Hey Gregory, thanks for reaching out. <br><br>\n",
    "I always thought that job posts were a gold mine of information, and often suggest identifying targets based on these (go look at relevant job posts for companies that might want to work with you). Secondly, I also automatically ping BuiltWith from our CRM and send that to OpenAI and have a summarized tech stack created - so I see the benefit of having this as an investor. <br><br>\n",
    "For me personally, I like to get as much data as possible about a company. Would love to see job post cadence, type of jobs they post and when, notable keywords/phrases used, tech stack (which you have), and any other information we can glean from the job posts (sometimes they have the title of who you'll report to, etc.). <br><br>\n",
    "For sales people, I think finer searches, maybe even in natural language if possible - such as \"search for companies who posted a data science related job for the first time\" - would be powerful.\n",
    "\n",
    "If you do this, let me know! I'd love to hear how it goes."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
